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Abstract 

In Natural Language Processing (NLP) tasks, data often has the fol- 
lowing two properties: First, data can be chopped into multi-views which 
has been successfully used for dimension reduction purposes. For example, 
in topic classification, every paper can be chopped into the title, the main 
text and the references. However, it is common that some of the views 
are less noisier than other views for supervised learning problems. Second, 
unlabeled data are easy to obtain while labeled data are relatively rare. 
For example, articles occurred on New York Times in recent 10 years are 
easy to grab but having them classified as 'Polities', 'Finance' or 'Sports' 
need human labor. Hence less noisy features are preferred before running 
supervised learning methods. In this paper we propose an unsupervised al- 
gorithm which optimally weights features from different views when these 
views are generated from a low dimensional hidden state, which occurs in 
widely used models like Mixture Gaussian Model, Hidden Markov Model 
(HMM) and Latent Dirichlet Allocation (LDA). 

1 introduction 

In areas like Natural Language Processing, data often have multi-view and high 
dimension. Recently, CCA jS] has been applied to the multi-view setting as a 
unsupervised dimension reduction method in |7j|10 |3 with performance guar- 
antee if the data is generated under certain structure. In [7] , they assume the 
high dimensional multi-view data is generated independently conditioning on 
a low dimensional hidden state (the model structure will be illustrated later 
in detail). Under this assumption, the low dimensional features provided by 
CCA won't lose any useful information compared with the original high dimen- 
sional features when applied to linear regression. Also, |5] has applied this CCA 
method to generate a low dimensional vector representation of words which 
works well in a lot of NLP tasks. 

The reason for CCA to work well is that the low dimensional hidden state 
(throughout the paper we'll use k to denote the dimension of hidden state) 
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contains most information for the supervised tasks and by doing CCA, we are 
able to generate k dimensional estimate of the hidden state from each view as 
mentioned by [4], or more precisely, we can find all k directions in the high 
dimensional space of each view that have non-zero correlation with the hidden 
state via CCA. 

Only two views are enough to implement the CCA algorithms above (see [7] 
for detailed introduction about CCA). Despite it's power in dimension reduc- 
tion, CCA with two views is still not optimal in the sense that it ends up with 
a hidden state estimator from each view but it's impossible to tell which view 
is better by only looking at the two views. Here's an cute example: 

Example 1. ho ~ N(0, 1) be the hidden state. Conditioning on the hidden state, 
two views are generated independently with v±\ho ~ N (ho, 0.1) and V2\ho ~ 
N(ho, 10), Clearly v\ is way better than vi if we want to estimate the hidden 
state since it's less noisier. However, since the only data we have are the two 
views, we can't do anything to figure out which view is more helpful in estimating 
the hidden state. 

Similar situation happens in [5] where the they have three views (the previ- 
ous context, the current word, the latter context) and end up with three hidden 
state estimators. 

This problem can be solved if we have three or more views. Actually, recent 
results have shown that more delicate problems can be solved if three or more 
views are available. [9J and [13! shows that we are able to compute sequential 
probability and conditional probability of an HMM with simple empirical statis- 
tics calculated from three consecutive observations. p] and [2] proved that we 
are able to recover the emission matrix of mixture models with spectral meth- 
ods when three different views of data are available. In this paper, we propose 
an algorithm where the hidden state estimators come from the three views are 
optimally combined to get a cleaner estimator of the hidden state in the sense 
that all other directions in the space are uncorrelated with the hidden state. 

The paper is organized as follows: In section 2 a formal mathematical state- 
ment of the problem and a short proof of the two view dimension reduction 
algorithm will be given as a warm up. In section 3 the three views algorithm 
will be stated and proofs are given. In section 4 experiments on simulated data 
are performed to illustrate the correctness and effectiveness of the three views 
algorithm. Section 5 is a short summary. 
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2 Preliminary 



2.1 Model Set Up 

In the multi-view problem, we have several views X = (X 1 ^ 2 , ..X n °) of the 
input data where X 1 are di x 1 random vectors and a target variable Y which 
need to be predicted. Take NLP problems as an example, each view X % can 
be the words in each paragraph of an article while Y can be the topic. Or as 
mentioned in [5] [TT], X 1 is the previous context of a word, X 2 is the current 
word, X 3 is the latter context and Y is properties of the current word. One 
key structure of our model which connects the response Y and the multi-view 
features X is the hidden state: 

Assumption 1. (Conditional Independence Assumption) Conditioning on a k 
dimensional hidden state H (k <C di for all i ), the one dimensional response 
Y , and the three views X , X 2 , X 3 are independent (since our algorithm needs 
only three views, from now on we are going to assume X has three views). 




Figure 1: The model structure: Conditioning on the hidden state, three views 
X 1 , X 2 , X 3 and the response Y are independent 

Moreover, in order CCA works, we need assumption about the structure of 
the covariance matrix between each pair of views. 

Assumption 2. (Linearity Assumption) K[Y\H],E[X l \H] are all linear in H, 
i.e. E[X l \H] = MiH and E[Y\H] = M Y H for some di x k matrix Mi and 1 x k 
matrix My ■ 

Assumption 3. (Full Rank Assumption) The matrices Mi, i = 1, 2, 3 have rank 
k. 

A lot of models fall into this category. For example, Hidden Markov Model 
(HMM) which is widely used in NLP [12]. Figure [5] shows an HMM with 
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of length 3. Let the transition matrix be T and the observation matrix be 
O, take Hi as the hidden state, then ELY 1 ^] = OH u E[A 2 |ffi] = OTH u 
E[Y 3 |iJi] = OT 2 Hx which are all linear in Hi. The Latent Dirichlet Allocation 
(LDA) model in [5] [T] and multi-view Gaussian Model like in [1] [5] also satisfies 
our assumptions. 




Figure 2: HMM of length three satisfies assumption 1,2 



2.2 Dimension Reduction with Two views 

In practice, X 1 are often high dimensional. For example, if X 1 are words in 
English, the dimension of the views are the size of the vocabulary. Another 
important issue is that in a lot of learning tasks, labelled data is rare while 
unlabeled data is common. In |6_ [11_ it's easy to get word with its context 
from the internet, while getting words labeled as 'plants' or 'animals' need a 
lot human effort. These observations lead to unsupervised dimension reduction 
algorithms which is illustrated by detail in [TJ. Here we briefly go though the 
two view case as a warm up for the three view situation. For simplicity, assume 
H, A 1 , A 2 has mean and identity variances, since we can always whiten the 
views and the hidden state. 

Let S ai b denote the covariance matrix of vector a, b and S^y denote the 
covariance of X % and Y (so the integer i refers to the i th view) . A straightforward 
conclusion following Assumption 1,2 is (lemma 7 in (7J): 

Lemma 1. 

Ei, 2 = ^[X 1 X 2T \H\] = MiE[iJ# T ]Af 2 T = M x Ml (1) 
S 1>y = E[E[A 1 y T |i/]] = MiE[iJiJ T ]Af^ = Mi My (2) 

Since both views have identity variances, CCA between X 1 and X 2 reduce 
to only an Singular Value Decomposition (SVD) of the covariance matrix(see [7J 
and [5] for introduction about CCA). 

Let Ei, 2 = UDV T be the SVD of the covariance (since the variances are 
identity, it's also correlation) matrix. Let Ui-.k, Vi.k be the first k columns of 
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U, V. By assumption 3, 

Ei, 2 = MiMj = Ui sk DV£ k (3) 

By definition of CCA, the Canonical Variables are X' 1 = U T X 1 and X' 2 = 
V T X 2 . [7] (theorem 3) claims that it suffices to pick the top k canonical vari- 
ables from each view. i.e. X£ k = U^X 1 and X' 2 k = V^ k X 2 as features if we 
predict Y with linear regression. 

The reason of their claim lies in two aspects: 
First, the covariance between X' k ,^. d , the feature in view 1 we throw away and 
Fis 

E x£ +1:il ,Y = Ul+v.^hMl (4) 

From equation (3), the range of M\ is the same as the range of U\ : k, hence 
columns of Uk+i-.di are orthogonal to columns of Mi. Together with (4), ^x' k 1 +1 d ,y 
0. Similarly, Y, x n ^ ^ y = 0. In other words, the directions(or features) we 
dropped with CCA are uncorrelated with our target variable Y. 
Second, we have the following lemma for linear regression: 

Lemma 2. We have two group of features (Zi,Z%), and want to predict Y 
linearly with (Z\,Z-2). Suppose the covariance matrices satisfy 

Z Y ,z 2 = 

Then the optimal linear predictor (in terms of the square loss) with Z\ is the 
same as the optimal linear predictor with {Z\,Z<i). 

Proof. Consider the Hilbert space of random variables where covariance is the 
inner product. The optimal linear predictor with Z\,Zi is the projection of Y 
onto the linear span of them. Our assumption means Y perpendicular to span 
of Zi (Y has zero covariance with Z2 and covariance is the inner product) , span 
of Z\ perpendicular to span of Z2, so the projection of Y onto span of Z±, Z2 is 
the same as to the projection of Y onto Z\. □ 

Let Z\ — (X'i. k ,X' 2 k ) and Z 2 = {X k+l . d ,X k+l . d ), this partition satisfies 
lemma 2. Therefore the optimal linear predictor with the low dimensional fea- 
ture (X'-^f., X' 2 k ) will be the same as X 1 , X 2 , or in other words, we get dimension 
reduction from d\ + di to 2k for free. 



Remark 1. After doing CCA, we obtain one k dimensional feature from each 
view, which can be regarded as estimators of the k dimensional hidden state. In 
order to estimate some feature Y which are independent of these views condition- 
ing on hidden state, one can first estimate the hidden state via CCA(unsupervised), 
then predict Y with the hidden state estimators (supervised) . The key property of 
CCA is the features throw away are uncorrelated with the Y , so it's reasonable 
to expect the CCA method to work well with other linear learning methods. 
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3 Optimal Weighting via Three Views 



As introduced in previous section, the two view CCA helps reducing dimension 
of the views to k, the dimension of hidden state. But one drawback of the two 
view CCA is we get one low dimensional estimator of the hidden state from 
each view, which may not be equally informative as illustrated in example 1. 
For instance, the abstract, main content and references can all help classify the 
topic of a paper, but are not equally informative. The main contribution of this 
paper is we find a way to optimally combine the estimators of the hidden state 
from each view to get a new hidden state estimator if three or more views are 
available. 

Here is the precise statement: 
Assume we have three k dimensional views A 1 , X 2 , A 3 (since we can reduce the 
dimension of each view to k with the CCA) and Y generated by a k dimensional 
hidden state and satisfy assumption 1,2,3. Use X — (A 1 ; X 2 ; X 3 ) to denote the 
catenation of the three views (so X is a 3k x 1 vector). Our goal is to look for a 
3k* k matrix Ui such that the optimal linear predictor (in terms of square loss) 
with the new k dimensional feature X* = U-[X is the same as the optimal lin- 
ear predictor with the 3k dimensional feature A. In other words, Ui optimally 
combines the hidden state estimators from each view. Still assume everything 
is mean and the hidden state H has identity variance. 

The following lemma proves the existence of the optimal k dimensional fea- 
ture A*: 

Lemma 3. There exist a k dimensional subspace in the linear span of A 1 , A 2 , X 3, 
(which is 3k dimensional) such that the optimal predictor with this subspace is 
the same as the optimal predictor with the whole space. 

Proof. Do a Canonical Correlation Analysis between random vectors H and A, 
Let X[. k denote the first k canonical components of A, X' k+1 . 3k be the rest. 
Since H is only k dimensional, by the definition of CCA, S X ' _ H = and 

>:.v< X- = o. 

By assumption 2, K[X' K+1 . 3K \H] = M4H for some 2k * k matrix M 4 . Since 

^ Ul:3k ,H = miX' k+ i-.3kH T \H]] = M 4 E[HH T ] = MJ = (5) 

We know M4 = 0. Lemma 1 implies E = M 4 M$ = 0. Apply lemma 

2, The optimal linear predictor with X' (the same as optimal linear predictor 
with A) is the same as the optimal linear predictor with A* = X[. k . □ 

Our algorithm find the above optimal subspace in a relatively indirect way. 
In order to illustrate the rationale behind the algorithm, it's helpful to dig a 
little bit into the CCA proof of lemma 3. 
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Let the rotation matrix on X given by the above CCA be Uq = (U\, U2), and 
X[. k = lf[X, X' k+1:3k = U^X. Let Q = T, XX , Q- 1 can be used to whitten X 
to have identity covariance. Let X" = Q~ X X, and Y>x",h has the full SVD: 

Z X »,H = PDV T 

Since X" and H all has identity covariance, the above SVD actually gives the 
CCA rotation for random vector X" and H, i.e P T X" are the canonical vari- 
ables. Moreover, since X" = Q~ 1 X 1 we know Uq = Q _1 P is the CCA rotation 
matrix for X. Let P = (Pi,P 2 ) where Pi denotes the first k columns and Pi 
denotes the last 2k columns, then 

Ui = Q^Pi (6) 
U 2 = Q~ 1 P 2 (7) 

Our goal is to look for U\, then we can get the optimal subspace by X* = U^X. 
The trick for the algorithm is, we first estimate the column space of U2, which is 
relatively easy, then we can find the column space of P 2 based on (7) since Q, as 
the square root of the covariance of X is easy to estimate. By property of SVD, 
Pi _L P 2 (means the column spaces of the two matrices are perpendicular) , so 
we can reconstruct the column space of Pi based on P 2 easily (note that U\ is 
not perpendicular to U 2 ). Finally ,we can find column space of U\ with Pi and 

Q by (6). 

Based on the above argument, it suffices to find the column space of U 2 . We 
need the following lemma: 

Lemma 4. Let a e ^ 3fcxl he a direction in 3k dimensional space. If for any 
b e R kxl , Cov(a T X,b T H) =0, a lies in the column space of U2 

Proof. Let a = c + d where c is in the column space of U\ and d is in column 
space of U2 (since U±, C/ 2 span the whole space and have no intersection except 
0, this decomposition of a is unique). It suffices to show c = 0. Note that 

Cov(a T X, b T H) = Cov(c T X, b T H) + Cov(d T X, b T H) 
= Cov(c T X, b T H) 

since d is in column space of J7 2 . Let U\ = (u\, 112, 113, ..Uk), since c lies in 
column space of U\, c = Y^l=i a i u i- Pick b to be any canonical directions of 
H, i.e any column of Vq, by the assumption of our lemma, Cov(d T X, b T H) = 
for all these b. Denote Vb = (v\,V2--Vk)- Moreover, since Ui,Vj are canonical 
directions, Cov(ufX, vj H) = if i 7^ j. Therefore 

k 

= Cov((Y,aiUi) T X,vjH) = Cov((a jUl ) T X,vJ H) 

i=l 

for all j. This implies aj = for j = l..k since Cor(uJX,vjH) is the j th 
canonical correlation which is non zero. Therefore c = 0, a = d lies in the 
column space of t/ 2 . 
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□ 



The above lemma shows that in order to find the column space of U2, it 
suffices to find 2k linear independent directions that satisfies lemma 2, which is 
easy. Run a CCA between random vectors A 1 and (A 2 ; X 3 ), we have: 

Lemma 5. the last k canonical directions o/(A 2 ; A 3 ) has correlation matrix 
with H, hence satisfy lemma 4- 

Proof. Denote the rotation matrix corresponding to the last k directions by R\ e 
R 2kxk , X 2 3 = Rj (A 2 ;A 3 ) are the last k canonical variables. By assumption 
2, E[X 23 \H} = M 5 H and ELY^i?] = M\H. Lemma 1 indicates ^x 23 ,i = 
M 5 Mi — 0, since Mi is k * k full rank by assumption 3, M 5 = 0, so T,x 23 .h = 
E[M 5 H] =0. □ 

Similarly, run a CCA (or Canonical Covariance Analysis) between random 
vectors X 3 and (A 1 ; A 2 ), the last k canonical directions of (A 1 ; A 2 ) has 
correlation matrix with H, hence satisfy lemma 2. Denote the rotation matrix 
corresponding to the last k directions by R2 € R 2kxk . For notation convenience, 
let 



R 2 



i?H 
R21 

R\2 

R22 



where all the blocks Rij are kxk. Finally, let O be k x k matrix with all zeros, 
Let 

/ Ru O \ 
R= R21 R12 (8) 
V O R22 ) 

If the R is full rank (which is true in most case), the column space of R is exactly 
the column space of U2 since every column of R satisfies lemma 2, and it form 
a basis. 



Based on the above argument, the algorithm for finding the optimal k di- 
mensional subspace is: 

Remark 2. In dimension reduction point of view, running two views CCA 
between each pair of views reduce the dimension from d\ + d 2 + d 3 to 3k and 
running the three views algorithm reduce the dimension from 3k to k. By doing 
CCA we find a k dimensional subspace in the d\ + c?2 + ^3 huge space which 
contains all the useful information in predicting the hidden state H and hence 
the variable Y. In fact this is the optimal unsupervised dimension reduction 
possible since the projection (in the Hilbert Space of random variables) of the 
hidden state onto the di+d 2 + (I3 feature space is exactly the k dimensions given 
by the CCA. 
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4 Experiments On Simulated Data 



In this section the three view algorithm is applied to a normal model. In this 
model, we have a k — 10 dimensional normal hidden state H with mean and 
identity covariance. Conditional on H, three views X % has normal distribution 
with mean AiZ (A t G k x k) and covariance ail (a\ = 2, er 2 = 0.5, 03 = 0.2). 
Our goal is to predict a random variable Y. Conditioning on Z, y is a normal 
with mean fiZ ((3 e 1 x k) and variance a = 0.5 (Ai and /3 are generated at 
random) . 

In the first experiment, we compare three groups of features. The first 
group is all the three views X = (X 1 , X 2 , X 3 ) (denoted as Si). The second 
is the k dimensional feature U-fX obtained by our algorithm (denoted as S 2 ). 
The third is also a k dimensional feature, but it's just averaging three views, 
i.e. X 1 + X 2 + X 3 (denoted as S3). We want to compare the square loss of 
the optimal predictor with the three features, therefore we run a regression with 
large amount of labeled data (5000) to make sure our linear predictors converges 
to the optimal ones. This experiment is repeated 100 times (use 100 different 
rotation matrix Ais). 

Figure 3 shows the square loss of the optimal predictor of Y with three groups 
of features. The Y axis is the square loss while the X axis indicates different 
trials. The left of Figure 3 shows the square loss of Si and S2 (S 2 is learned 
with 50000 unlabeled data), the right side of Figure 3 shows the square loss of 
5*2 and 5*3. Easy to see the square loss of Si and S2 is pretty close most of the 
time while the square loss of S3 is much larger. Figure 4 shows the histogram 
of optimal square loss ratio for this 100 trials. The left figure is squarc } oss °i. ^ 2 

1 ^ square loss or h\ 

and the right figure is squarc ! oss °* s q 3 . Easy to see in most cases squarc ! oss °* ^ 2 

° square loss or Oi J squarc loss or b\ 

distributed very close to 1, i.e. the optimal square loss of Si and S2 are almost 
the same while in most cases optimal square loss of 5*3 is way larger than Si . 
The second experiment is about the sample size. We run the three views 





Algorithm: Optimal Weighting of Three Views 


Stepl 


Estimate the 3fc x 3k covariance matrix T,x,x empirically, 
and compute Q as the square root of Sx.x 


Step2 


Perform CCA between X 1 a,nd(X 2 , X 3 ) to obtain rotation matrix R\. 
Perform CCA between X 3 and(A x ,X 2 ) to obtain rotation matrix R2 




Step3 


Construct R based on Ri , R 2 with equation (4) 


Step4 


Compute P 2 = QR 


Step5 


Compute Pi by finding the orthogonal complement of P 2 


Step6 


Compute Ui =Q~ 1 Pi 

Ui is the matrix which project X to the optimal k dimensional subspace. 



Table 1: Finding Optimal k Dimensional Subspace with Three Views 
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Figure 3: The square loss of the optimal linear predictor using three different 
feature sets 



algorithm on different amount of X to obtain 52 (The sample size of Group 1 
to 7 are: 500, 1000, 2000, 4000, 8000, 10000, 20000). For each group, we run 
100 experiments and box plot the optimal square loss of S2 in for each group. 

Figure 5 shows the optimal square loss of S2 of different sample sizes. The 
dash line at about y=0.256 is the average optimal square loss of the 3k feature 
set Si, i.e. the asymptote optimal if the sample size is large enough. Our al- 
gorithm performs better as sample size increases. When sample size is about 
20000 (Group 7) the square loss of S% becomes close to the square loss of Si. 

The third experiment shows the advantage of our three view algorithm when 
the amount of labeled data is limited. Still consider predicting Y with linear 
regression. As we know, the square loss of regression can be decomposed into 
bias and variance. In section 3 and the first experiment it is shown that the 
dimension reduction of our three views algorithm doesn't introduce any bias. 
Moreover, the variances are reduced due to reduced dimensionality. In the third 
experiment, we compare the square loss of predicting Y with Si and S2 (S2 is 
learned with 50000 unlabeled data) . Four groups of experiment are performed 
with different amounts of labeled data (labeled data size are: 40,80,150,400, the 
dimension of Si is 30 and the dimension of S2 is 10). In each group, 25 different 
model parameters (different Ai and /?) are randomly generated and for each 
parameter set up, we estimate the square loss by simulation. 

The square loss of 25 parameter set ups in each group are box plotted in 
Figure 6 (labeled data size increase from left to right). Easy to see, when lacking 
labeled data our three view feature S2 outperform the original feature Si and 
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Figure 4: The histogram of optimal square loss ratio of different feature sets 
the difference becomes smaller when more labeled data are available. 



5 Summary 

We see how CCA can be applied for dimension reduction and optimal weighting 
in the multi-view model with a hidden state, which is assumed to carry most 
information for supervised learning problems. After doing CCA, we end up with 
a k dimensional feature space which achieves optimal dimension reduction. This 
dimension reduction method works very well when huge amount of unlabeled 
data are available while labeled data are limited. If more than three views are 
available, we only need to group the views into three disjoint parts and these 
three parts can act as three views in our algorithm. 
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